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Abstract 

We consider the classic Set Cover problem in the data stream model. For n elements and m 
sets (m > n) we give a 0(l/5)-pass algorithm with a strongly sub-linear Q{mn^) space and log¬ 
arithmic approximation factor^. This yields a significant improvement over the earlier algorithm 
of Demaine et al. [DIMV14] that uses exponentially larger number of passes. We complement 
this result by showing that the tradeoff between the number of passes and space exhibited by 
our algorithm is tight, at least when the approximation factor is equal to 1. Specifically, we 
show that any algorithm that computes set cover exactly using (^ — 1) passes must use ^{mn^) 
space in the regime of m = 0(n). Furthermore, we consider the problem in the geometric 
setting where the elements are points in and sets are either discs, axis-parallel rectangles, 
or fat triangles in the plane, and show that our algorithm (with a slight modification) uses the 
optimal (9(^) space to find a logarithmic approximation in 0(l/i5) passes. 

Finally, we show that any randomized one-pass algorithm that distinguishes between covers 
of size 2 and 3 must use a linear (i.e., VLimn)) amount of space. This is the first result showing 
that a randomized, approximate algorithm cannot achieve a space bound that is sublinear in 
the input size. 

This indicates that using multiple passes might be necessary in order to achieve sub-linear 
space bounds for this problem while guaranteeing small approximation factors. 


1 Introduction 

The Set Cover problem is a classic combinatorial optimization task. Given a ground set of n 
elements U = {ei,--- ,e„}, and a family of m sets T = {ri,...,rm} where m > n, the goal is 
to select a subset X C T" such that X covers U, i.e., U C number of the sets in 

X is as small as possible. Set Cover is a well-studied problem with applications in many areas, 
including operations research [GW97], information retrieval and data mining [SG09], web host 
analysis [GKTIO], and many others. 
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Foundation. 
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Although the problem of finding an optimal solution is NP-complete, a natural greedy algorithm 
which iteratively picks the “best” remaining set is widely used. The algorithm often finds solutions 
that are very close to optimal. Unfortunately, due to its sequential nature, this algorithm does 
not scale very well to massive data sets (e.g., see Cormode et al. [CKWIO] for an experimental 
evaluation). This difficulty has motivated a considerable research effort whose goal was to design 
algorithms that are capable of handling large data efficiently on modern architectures. Of particular 
interest are data stream algorithms, which compute the solution using only a small number of 
sequential passes over the data using a limited memory. In the streaming Set Cover problem [SG09], 
the set of elements U is stored in the memory in advance; the sets ri, • • • , are stored consecutively 
in a read-only repository and an algorithm can access the sets only by performing sequential scans 
of the repository. However, the amount of read-write memory available to the algorithm is limited, 
and is smaller than the input size (which could be as large as mn). The objective is to design 
an efficient approximation algorithm for the Set Cover problem that performs few passes over the 
data, and uses as little memory as possible. 

The last few years have witnessed a rapid development of new streaming algorithms for the 
Set Cover problem, in both theory and applied communities, see [SG09, GKWIO, KMVV13, ER14, 
DIMV14, GW16]. Figure 1.1 presents the approximation and space bounds achieved by those 
algorithms, as well as the lower bounds^. 

Related work. The semi-streaming Set Cover problem was first studied by Saha and Getoor 
[SG09] . Their result for Max A:-Cover problem implies a 0(log n)-pass 0(log n)-approximation algo¬ 
rithm for the Set Cover problem that uses 0(^^) space. Adopting the standard greedy algorithm of 
Set Cover with a thresholding technique leads to 0(logn)-pass 0(logn)-approximation using O(^) 
space. In (9(n) space regime, Emek and Rosen studied designing one-pass streaming algorithms for 
the Set Cover problem [ER14] and gave a deterministic greedy based 0(-y/n)-approximation for the 
problem. Moreover they proved that their algorithm is tight, even for randomized algorithms. The 
lower/upper bound results of [ER14] applied also to a generalization of the Set Cover problem, the 
e-Partial Set Cover(U, problem in which the goal is to cover (1 —e) fraction of elements U and the 
size of the solution is compared to the size of an optimal cover of Set Cover(U, J^). Very recently, 
Chakrabarti and Wirth extended the result of [ER14] and gave a trade-off streaming algorithm 
for the Set Cover problem in multiple passes [CW16]. They gave a deterministic algorithm with p 
passes over the data stream that returns a (p -|- l)n^/^^^^^-approximate solution of the Set Cover 
problem in 0(n) space. Moreover they proved that achieving 0.99n^/*'^+^^/(p-|-1)^ in p passes using 
0(n) space is not possible even for randomized protocols which shows that their algorithm is tight 
up to a factor of (p -|- 1)^. Their result also works for the e-Partial Set Cover problem. 

In a different regime which was first studied by Demaine et ah, the goal is to design a “low” 
approximation algorithms (depending on the computational model, it could be O(logn) or 0(1)) in 
the smallest possible space [DIMV14]. They proved that any constant pass deterministic (logn/2)- 
approximation algorithm for the Set Cover problem requires Q{mn) space. It shows that unlike 
the results in (9(re)-space regime, to obtain a sublinear “low” approximation streaming algorithm 
for the Set Cover problem in a constant number of passes, using randomness is necessary. More¬ 
over, [DIMV14] presented a 0(4^/^)-approximation algorithm that makes 0(4^/*^) passes and uses 
Q{mn^) memory space. 

The Set Cover problem is not polynomially solvable even in the restricted instances with points 

^Note that the simple greedy algorithm can be implemented by either storing the whole input (in one pass), or by 
iteratively updating the set of yet-uncovered elements (in at most n passes). 
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Figure 1.1: Summary of past work and our results. The last column indicates if the scheme is 
randomized, p denotes the approximation factor of an off-line algorithm solving Set Cover, which is 
In n for the greedy, and 1 for exponential algorithm. Similarly, pg denotes the approximation factor 
of an off-line algorithm solving geometric Set Cover. Finally, in the s-Sparse Set Cover problem, 
s < denotes an upper bound on the sizes of the input sets. Our lower bounds for Set Cover and 
s-Sparse Set Cover hold for m = 0{n). Moreover, [ER14] and [CW16] proved that their algorithms 
are tight. Here, and in the rest of the paper, all log are in base two. 


in as elements, and geometric objects (either all disks or axis parallel rectangles or fat triangles) 
in plane as sets [FG88, FPT81, HQ15]. As a result, there has been a large body of work on designing 
approximation algorithms for the geometric Set Cover problems. See for example [MRR14, AP14, 
AESIO, CV07] and references therein. 

1.1 Our results 

Despite the progress outlined above, however, some basic questions still remained open. In partic¬ 
ular: 

(A) Is it possible to design a single pass streaming algorithm with a “low” approximation factor^ 
that uses sublinear (i.e., o{mn)) space? 

(B) If such single pass algorithms are not possible, what are the achievable trade-offs between 
the number of passes and space usage? 

(C) Are there special instances of the problem for which more efficient algorithms can be de¬ 
signed? 

In this paper, we make a significant progress on each of these questions. Our upper and lower 
bounds are depicted in Figure 1.1. 

^Note that the lower bound in [DIMV14] excluded this possibility only for deterministic algorithms, while the 
upper bound in [ER14, CW16] suffered from a polynomial approximation factor. 
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On the algorithmic side, we give a 0(l/(5)-pass algorithm with a strongly snb-linear Q{mn^) 
space and logarithmic approximation factor. This yields a significant improvement over the earlier 
algorithm of Demaine et al. [DIMV14] which used exponentially larger number of passes. The trade¬ 
off offered by our algorithm matches the lower bound of Nisan [Nis02] that holds at the endpoint of 
the trade-off curve, i.e., for S = 0(1/logn), up to poly-logarithmic factors in space^. Furthermore, 
our algorithm is very simple and succinct, and therefore easy to implement and deploy. 

Our algorithm exhibits a natural tradeoff between the number of passes and space, which resem¬ 
bles tradeoffs achieved for other problems [GM07, GM08, G013]. It is thus natural to conjecture 
that this tradeoff might be tight, at least for “low enough” approximation factors. We present the 
first step in this direction by showing a lower bound for the case when the approximation factor 
is equal to 1, i.e., the goal is to compute the optimal set cover. In particular, by an information 
theoretic lower bound, we show that any streaming algorithm that computes set cover using (^ — 1) 
passes must use Q,{mn^) space (even assuming exponential computational power) in the regime of 
m = 0{n). Furthermore, we show that a stronger lower bound holds if all the input sets are sparse, 
that is if their cardinality is at most s. We prove a lower bound of ^{ms) for s = 0{n^) and 
m = 0(n). 

We also consider the problem in the geometric setting in which the elements are points in and 
sets are either discs, axis-parallel rectangles, or fat triangles in the plane. We show that a slightly 
modified version of our algorithm achieves the optimal 0(n) space to find an 0(p)-approximation 
in 0(1) passes. 

Finally, we show that any randomized one-pass algorithm that distinguishes between covers of 
size 2 and 3 must use a linear (i.e., Q(mn)) amount of space. This is the first result showing that 
a randomized, approximate algorithm cannot achieve a sub-linear space bound. 

Recently Assadi et al. [AKLI6] generalized this lower bound to any approximation ratio a = 
0{ytn). More precisely they showed that approximating Set Cover within any factor a = 0{^/n) 
in a single pass requires II(^) space. 

Our techniques: Basic idea. Our algorithm is based on the idea that whenever a large enough 
set is encountered, we can immediately add it to the cover. Specifically, we guess (up to factor 
two) the size of the optimal cover k. Thus, a set is “large” if it covers at least 1/k fraction of 
the remaining elements. A small set, on the other hand, can cover only a “few” elements, and we 
can store (approximately) what elements it covers by storing (in memory) an appropriate random 
sample. At the end of the pass, we have (in memory) the projections of “small” sets onto the 
random sample, and we compute the optimal set cover for this projected instance using an offline 
solver. By carefully choosing the size of the random sample, this guarantees that only a small 
fraction of the set system remains uncovered. The algorithm then makes an additional pass to find 
the residual set system (i.e., the yet uncovered elements), making two passes in each iteration, and 
continuing to the next iteration. 

Thus, one can think about the algorithm as being based on a simple iterative “dimensionality re¬ 
duction” approach. Specifically, in two passes over the data, the algorithm selects a “small” number 
of sets that cover all but n~^ fraction of the uncovered elements, while using only Q{mn^) space. By 
performing the reduction step \/5 times we obtain a complete cover. The dimensionality reduction 
step is implemented by computing a small cover for a random subset of the elements, which also 
covers the vast majority of the elements in the ground set. This ensures that the remaining sets, 

^Note that to achieve a logarithmic approximation ratio we can use an off-line algorithm with the approximation 
ratio p = 1, i.e., one that runs in exponential time (see Theorem 2.8). 
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when restricted to the random subset of the elements, occupy only Q{mn^) space. As a result the 
procedure avoids a complex set of recursive calls as presented in Demaine et al. [DIMV14], which 
leads to a simpler and more efficient algorithm. 

Geometric results. Further using techniques and results from computational geometry we show 
how to modify our algorithm so that it achieves almost optimal bounds for the Set Cover problem on 
geometric instances. In particular, we show that it gives a 0(l)-pass 0(/o)-approximation algorithm 
using (9(^) space when the elements are points in and the sets are either discs, axis parallel 
rectangles, or fat triangles in the plane. In particular, we use the following surprising property of 
the set systems that arise out of points and disks: the the number of sets is nearly linear as long 
as one considers only sets that contain “a few” points. 

More surprisingly, this property extends, with a twist, to certain geometric range spaces that 
might have quadratic number of shallow ranges. Indeed, it is easy to show an example of n points 
in the plane, where there are n(n^) distinct rectangles, each one containing exactly two points, see 
Figure 1.2. However, one can “split” such ranges into a small number of canonical sets, such that 
the number of shallow sets in the canonical set system is near linear. This enables us to store the 
small canonical sets encountered during the scan explicitly in memory, and still use only near linear 
space. 



Figure 1.2: Consider two parallel lines in the plane with positive slope. Place n/2 points on each 
line such that all the points on the top line lie above and to the left of all the points on the bottom 
line. Let the set of rectangles for this instance be all the rectangles which have a point on the top 
line as their upper left corner and a point on the bottom line as their lower right corner. Clearly, 
we have n^/4 distinct rectangles (i.e., sets), each containing two points. As such, we cannot afford 
to store explicitly in memory the set system, since it requires too much space. 

We note that the idea of splitting ranges into small canonical ranges is an old idea in orthogonal 
range searching. It was used by Aronov et al. [AES 10] for computing small e-nets for these range 
spaces. The idea in the form we use, was further formalized by Ene et al. [EHR12]. 

Lower bounds. The lower bounds for multi-pass algorithms for the Set Cover problem are ob¬ 
tained via a careful reduction from Intersection Set Chasing. The latter problem is a communica¬ 
tion complexity problem where n players need to solve a certain “set-disjointness-like” problem in 
p rounds. A recent paper [G013] showed that this problem requires '^^^0(1)^'’ bits of communica¬ 
tion complexity for p rounds. This yields our desired trade-off of space in 1/2S passes for 

exact protocols for Set Cover in the communication model and hence in the streaming model for 
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iterSetCover((U, -F), <5): 

// Try in parallel all possible (2-approx) sizes of optimal cover 
for k G {2* I 0 < z < logn} do in parallel: // n = |U| 

sol 0 

Repeat for 1/5 times 

Let S be a sample of U of size cpkn^ log m log n 
L ^ S, -Fs ^ 0 

for r G F" do //By doing one pass 

if |L n r| > |S|/A: then // Size Test 
sol ^ sol U {r} 

L\r 

else 

Fs^FsU{rnL} // Store the set rPlL explicitly in memory 
P ■<— algOfflineSC(L, Fs, A:), sol<—solljP 
U ^ U \ U.6S01 r // By doing additional pass over data 
return best sol computed in all parallel executions. 

Figure 1.3: A tight streaming algorithm for the (unweighted) Set Cover problem. Here, algOffli- 
neSC is an offline solver for Set Cover that provides p-approximation, and c is some appropriate 
constant. 

m = 0{n). Furthermore, we show a stronger lower bound on memory space of sparse instances 
of Set Cover in which all input sets have cardinality at most s. By a reduction from a variant of 
Equal Pointer Chasing which maps the problem to a sparse instance of Set Cover, we show that 
in order to have an exact streaming algorithm for s-Sparse Set Cover with o{ms) space, n(logre) 
passes is necessary. More precisely, any — l)-pass exact randomized algorithm for s-Sparse Set 
Cover requires fl{ms) memory space, if s < and m = 0{n). 

Our single pass lower bound proceeds by showing a lower bound for a one-way communication 
complexity problem in which one party (Alice) has a collection of sets, and the other party (Bob) 
needs to determine whether the complement of his set is covered by one of the Alice’s sets. We 
show that if Alice’s sets are chosen at random, then Bob can decode Alice’s input by employing a 
small collection of “query” sets. This implies that the amount of communication needed to solve 
the problem is linear in the description size of Alice’s sets, which is 

2 Streaming Algorithm for Set Cover 

2.1 Algorithm 

In this section, we design an efficient streaming algorithm for the Set Cover problem that matches 
the lower bound results we already know about the problem. In the Set Cover problem, for a given 
set system (U,F), the goal is to find a subset X C F, such that X covers U and its cardinality is 
minimum. In the following, we sketch the iterSetCover algorithm (see also Figure 1.3). 

In the iterSetCover algorithm, we have access to the algOffiineSC subroutine that solves the 
given Set Cover instance offline (using linear space) and returns a p-approximate solution where p 
could be anywhere between 1 and 0(logn) depending on the computational model one assumes. 
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Under exponential computational power, we can achieve the optimal cover of the given instance of 
the Set Cover {p = 1); however, under P 7 ^ NP assumption, p cannot be better than c • Inn where 
c is a constant [Fei98, RS97, AMS06, Mosl2, DS14] given polynomial computational power. 

Let n = |U| be the initial number of elements in the given ground set. The iter Set Cover 
algorithm, needs to guess (up to a factor of two) the size of the optimal cover of (LljT"). To this 
end, the algorithm tries, in parallel, all values k in {2® | 0 < i < logn}. This step will only increase 
the memory space requirement by a factor of log n. 

Consider the run of the iterSetCover algorithm, in which the guess k is correct (i.e., |OPT| < 
k < 2|OPT|, where OPT is an optimal solution). The idea is to go through 0{l/5) iterations such 
that each iteration only makes two passes and at the end of each iteration the number of uncovered 
elements reduces by a factor of . Moreover, the algorithm is allowed to use Q{mn^) space. 

In each iteration, the algorithm starts with the current ground set of uncovered elements U, 
and copies it to a leftover set L. Let S be a large enough uniform sample of elements U. In a 
single pass, using S, we estimate the size of all large sets in T and add r G T" to the solution sol 
immediately (thus avoiding the need to store it in memory). Formally, if r covers at least n(|U|/fe) 
yet-uncovered elements of L then it is a heavy set, and the algorithm immediately adds it to the 
output cover. Otherwise, if a set is small, i.e., its covers less than |U|/A: uncovered elements of L, 
the algorithm stores the set r in memory. Fortunately, it is enough to store its projection over the 
sampled elements explicitly (i.e., rPl L) - this requires remembering only the 0(|S|/A;) indices of the 
elements of r n L. 

In order to show that a solution of the Set Cover problem over the sampled elements is a 
good cover of the initial Set Cover instance, we apply the relative (p,e)-approximation sampling 
result of [HSll] (see Definition 2.4) and it is enough for S to be of size Q{pkn^). Using relative 
{p, e)-approximation sampling, we show that after two passes the number of uncovered elements is 
reduced by a factor of n^. Note that the relative (p, e)-approximation sampling improves over the 
Element Sampling technique used in [DIMV14] with respect to the number of passes. 

Since in each iteration we pick 0{pk) sets and the number of uncovered elements decreases 
by a factor of after 1/5 iterations the algorithm picks 0{pk/5) sets and covers all elements. 
Moreover, the memory space of the whole algorithm is Q{pmn^) (see Lemma 2.2). 

2.2 Analysis 

In the rest of this section we prove that the iterSetCover algorithm with high probability returns 
a 0 (/ 9 /( 5 )-approximate solution of Set Cover(U, T") in 2/5 passes using Q{mn^) memory space. 

Lemma 2.1. The number of passes the iterSetCover algorithm makes is 2/5. 

Proof: In each of the 1/5 iterations of the iterSetCover algorithm, the algorithm makes two 
passes. In the first pass, based on the set of sampled elements S, it decides whether to pick a set 
or keep its projection over S (i.e., r n L) in the memory. Then the algorithm calls algOfflineSC 
which does not require any passes over T". The second pass is for computing the set of uncovered 
elements at the end of the iteration. We need this pass because we only know the projection of the 
sets we picked in the current iteration over S and not over the original set of uncovered elements. 
Thus, in total we make 2/5 passes. Also note that for different guesses for the value of k, we run 
the algorithm in parallel and hence the total number of passes remains 2/5. ■ 

Lemma 2.2. The memory space used by the iterSetCover algorithm is Q{mn^). 
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Proof: In each iteration of the algorithm, it picks during the first pass at most m sets (more 
precisely at most k sets) which requires 0(m log m) memory. Moreover, in the first pass we keep 
the projection of the sets whose projection over the uncovered sampled elements has size at most 
|S|//c. Since there are at most m such sets, the total required space for storing the projections is 
bounded by O i^pmn^ log m log n). 

Since in the second pass the algorithm only updates the set of uncovered elements, the amount 
of space required in the second pass is 0{n). Thus, the total required space to perform each 
iteration of the iterSetCover algorithm is Q{mn^). Moreover, note that the algorithm does not 
need to keep the memory space used by the earlier iterations; thus, the total space consumed by 
the algorithm is Q{mnf). ■ 


Next we show the sets we picked before calling algOfflineSC has large size on U. 

Lemma 2.3. With probability at least 1 — m~‘^ all sets that pass the “Size Test” in the iterSet¬ 
Cover algorithm have size at least |U|/cA;. 

Proof: Let r be a set of size less than |U|/cA:. In expectation, |rnS| is less than (|U|/cA:) • (|S|/|U|) = 
pn^ log m log n. By Chernoff bound for large enough c, 

Pr(|r n S| > cpn^ logmlogn) < 

Applying the union bound, with probability at least 1 — all sets passing “Size Test” have size 
at least |U|/(cA:). ■ 


In what follows we define the relative {p,£)-approximation sample of a set system and mention 
the result of Har-Peled and Sharir [HSll] on the minimum required number of sampled elements 
to get a relative (p, e)-approximation of the given set system. 


Definition 2.4. Let (fJ,%) be a set system, i.e., V is a set of elements and C 2^ is a family of 
subsets of the ground set V. For given parameters 0 < e,p < 1, a subset Z C V is a relative 
{p, e)-approximation for if for each r G we have that if |r| > p|V| then 



rn z 

~w 




If the range is light (i.e., 


< p|V|) then it is required that 



ep < 


rnzl |r| 


Namely, Z is (1 ± e)-multiplicative good estimator for the size of ranges that are at least p-fraction 
of the ground set. 


The following lemma is a simplified variant of a result in Har-Peled and Sharir [HSll] - indeed, 
a set system with M sets, can have VC dimension at most logM. This simplified form also follows 
by a somewhat careful but straightforward application of Chernoff’s inequality. 


Lemma 2.5. Let (U,J^) be a finite set system, andp,£,q be parameters. Then, a random sample 
of U such that jUj = ^^log |log ^ + log ^, for an absolute constant d is a relative (p,e)- 
approximation, for all ranges in T, with probability at least {1 — q). 




Lemma 2.6. Assuming |OPT| < k < 2|OPT|, after any iteration, with probability at least 1 — 
number of uncovered elements decreases by a factor of , and this iteration adds 
0(/9|0PT|) sets to the output cover. 

Proof: Let V C U be the set of uncovered elements at the beginning of the iteration and note that the 
total number of sets that is picked during the iteration is at most {l+p)k (see Lemma 2.3). Consider 
all possible such covers, that is ^ = {F' C F\ \F'\ < (1 + p)k}, and observe that \Q\ < Let 

PL be the collection that contains all possible sets of uncovered elements at the end of the iteration, 
defined as = {V \ UreC '' | C ^ ■ Moreover, set p = 2/n^, e = 1/2 and q = m~^ and note that 

< \G\ < . Since ^(log |'H| log ^ + log^) < cp/cn'^ log mlogn = |S| for large enough c, 

by Lemma 2.5, S is a relative {p, e)-approximation of (V, PL) with (1 — q) probability. Let T) C. F he 
the collection of sets picked during the iteration which covers all elements in S. Since S is a relative 
(p, e)-approximation sample of {\/ ,PL) with probability at least 1 — the number of uncovered 
elements of V (or U) by V is at most £p|V| = |U|/n‘^. 

Hence, in each iteration we pick 0{pk) sets and at the end of iteration the number of uncovered 
elements reduces by n^. ■ 

Lemma 2.7. The iterSetCover algorithm computes a set cover of{[J,F), whose size is within a 
0{p/6) factor of the size of an optimal cover with probability at least 1 — 

Proof: Consider the run of iterSetCover for which the value of k is between |OPT| and 2|OPT|. 
In each of the (1/(5) iterations made by the algorithm, by Lemma 2.6, the number of uncovered 
elements decreases by a factor of where n is the number of initial elements to be covered by the 
sets. Moreover, the number of sets picked in each iteration is 0{pk). Thus after (1/(5) iterations, all 
elements would be covered and the total number of sets in the solution is 0(p|0PT|/(5). Moreover 
by Lemma 2.6, the success probability of all the iterations, is at least 1 — > 1 — .m 

Theorem 2.8. The iterSetCover(U, T", (5) algorithm makes 2/6 passes, uses Q{mn^) memory 
space, and finds a 0{p/6)-approximate solution of the Set Cover problem with high probability. 

Furthermore, given enough number of passes the iterSetCover algorithm matches the known 
lower bound on the memory space of the streaming Set Cover problem up to a polylog(m) factor 
where m is the number of sets in the input. 

Proof: The first part of the proof implied by Lemma 2.1, Lemma 2.2, and Lemma 2.7. 

As for the lower bound, note that by a result of Nisan [Nis02], any randomized (^^^)-approxima- 
tion protocol for Set Cover(U, T") in the one-way communication model requires Pl(m) bits of com¬ 
munication, no matter how many number of rounds it makes. This implies that any randomized 
0(logn)-pass, (^2|l^)-approximation algorithm for Set Cover(U, T") requires Q{m) space, even under 
the exponential computational power assumption. 

By the above, the iterSetCover algorithm makes 0{l/6) passes and uses Q{mn^) space to 
return a 0(|)-approximate solution under the exponential computational power assumption [p = 
1). Thus by letting 6 = c/logn, we will have a (^^^)-approximation streaming algorithm using 
(9(m) space which is optimal up to a factor of polylog(m). ■ 

Theorem 2.8 provides a strong indication that our trade-off algorithm is optimal. 
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3 Lower Bound for Single Pass Algorithms 


In this section, we study the Set Cover problem in the two-party communication model and give 
a tight lower bound on the communication complexity of the randomized protocols solving the 
problem in a single round. In the two-party Set Cover, we are given a set of elements U and there 
are two players Alice and Bob where each of them has a collection of subsets of U, Ta and Tb- 
The goal for them is to find a minimum size cover C C Ta U -Tg covering U while communicating 
the fewest number of bits from Alice to Bob (In this model Alice communicates to Bob and then 
Bob should report a solution). 

Our main lower bound result for the single pass protocols for Set Cover is the following theorem 
which implies that the naive approach in which one party sends all of its sets to the the other one 
is optimal. 

Theorem 3.1. Any single round randomized protocol that approximates Set Cover(U, J-") within a 
factor better than ‘il2 and error probability requires Q.[mn) bits of communication where 

n = |U| and m = \F\ and c is a sufficiently large constant. 

We consider the case in which the parties want to decide whether there exists a cover of size 2 
for U in Fa U Fb or not. If any of the parties has a cover of size at most 2 for U, then it becomes 
trivial. Thus the question is whether there exist G Fa and rf, G Fb such that U C U rj,. 

A key observation is that to decide whether there exist Tq G Fa and r;, G Fb such that U C raUra, 
one can instead check whether there exists G Fa and x\, ^ Fb such that n Fb = 0. In other 
words we need to solve OR of a series of two-party Set Disjointness problems. In two-party Set 
Disjorntness problem, Alice and Bob are given subsets of U, and and the goal is to decide 
whether n rf, is empty or not with the fewest possible bits of communication. Set Disjointness is a 
well-studied problem in the communication complexity and it has been shown that any randomized 
protocol for Set Disjointness with 0(1) error probability requires Q(n) bits of communication where 
n = |U| [BJKS04, KS92, Raz92]. 

We can think of the following extensions of the Set Disjointness problem. 

Many vs One: In this variant, Alice has m subsets of U, Fa and Bob is given a single set r;,. The 
goal is to determine whether there exists a set G Fa such that n = 0. 

Many vs Many: In this variant, each of Alice and Bob are given a collection of subsets of U and the 
goal for them is to determine whether there exist G Fa and G TA such that n r;, = 0. 
Note that deciding whether two-party Set Cover has a cover of size 2 is equivalent to solving the 
(Many vs Many)-Set Disjointness problem. Moreover, any lower bound for (Many vs One)-Set Dis¬ 
jointness clearly implies the same lower bound for the (Many vs Many)-Set Disjointness problem. 
In the following theorem we show that any single-round randomized protocol that solves (Many vs 
One)-Set Disjointness(m, n) with error probability requires OL{mn) bits of communication. 


Theorem 3.2. Any randomized protocol for (Many vs Onej-Set Disjointness(m, n) with error 
probability that is requires Q{mn) bits of communication if n > cilogm where c and 

Cl are large enough constants. 

The idea is to show that if there exists a single-round randomized protocol for the problem with 
o{mn) bits of communication and error probability then with constant probability one 

can distinguish 12(2™’^) distinct inputs using o{mn) bits which is a contradiction. 
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algRecoverBit(U, s): 

-Fa^0 

for i = 1 to do 

Let rfe be a random subset of U of size ci log m 
if algExistsDisj(s, r;,) = true 
// Discovering the set (or union of sets) 

// in Ta disjoint from r^, 
r ^ 0 

for e G U \ rb 

if algExistsDisj(rb U e, s) = false 

r ■(— r U e 

if 3r' G Fa s.t. r C r' // Pruning step 
Fa Fa \ {d}, Fa Fa'J {r} 
else if $r' G Fa s.t. r' C r 
Fa ^ Fa'J {r} 

return Fa 

Figure 3.1: algRecoverBit uses a protocol for (Many vs Onej-Set Disjointness(m, n) to recover 
Alice’s sets, Fa in Bob’s side. 


Suppose that Alice has a collection of m uniformly and independently random subsets of U (in 
each of her subsets the probability that e G U is in the subset is 1/2). Lets assume that there 
exists a single round protocol I for (Many vs One)-Set Disjointness(n, m) with error probability 
using o{mn) bits of communication. Let algExistsDisj be Bob’s algorithm in protocol I. 
Then we show that one can recover mn random bits with constant probability using algExistsDisj 
subroutine and the message s sent by the first party in protocol I. The algRecoverBit which is 
shown in Figure 3.1, is the algorithm to recover random bits using protocol I and algExistsDisj. 

To this end, Bob gets the message s communicated by protocol I from Alice and considers all 
subsets of size ci logm and ci logm + 1 of U. Note that s is communicated only once and thus the 
same s is used for all queries that Bob makes. Then at each step Bob picks a random subset rj, 
of size Cl logm of U and solve the (Many vs One)-Set Disjointness problem with input {Fa, '^b) by 
running algExistsDisj (s, r;,). Next we show that if is disjoint from a set in Fa, then with high 
probability there is exactly one set in Fa which is disjoint from ri, (see Lemma 3.3). Thus once Bob 
finds out that his query, r^, is disjoint from a set in Fa, he can query all sets r/" G{r6 Ue|eGU\rb} 
and recover the set (or union of sets) in Fa that is disjoint from rj,. By a simple pruning step we 
can detect the ones that are union of more than one set in Fa and only keep the sets in Fa- 

In Lemma 3.6, we show that the number of queries that Bob is required to make to recover Fa 
is 0{mP) where c is a constant. 

Lemma 3.3. Let he a random subset o/U of size clogm and let Fa be a collection of m random 
subsets of U. The probability that there exists exactly one set in Fa that is disjoint from is at 
least cVT * 
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Proof: The probability that is disjoint from exactly one set in is 


Pr(rfc is disjoint from > 1 set in P'a) 


Pr(rb is disjoint from > 2 sets in Ta) > (_jclogm 


j^C+l 



First we prove the first term in the above inequality. For an arbitrary set r G Ta, since any element 
is contained in r with probability the probability that r is disjoint from is (1/2)'^^°®"*. 


Pr(rft is disjoint from at least one set in P'a) > 2 


Moreover since there exist (™) pairs of sets in J^a, and for each ri, r 2 G Ta, the probability that ri 
and r 2 are disjoint from r;, is 


Pr(rb is disjoint from at least two sets in Ta) < nr 


(2c-2) 


A family of sets Ai is called intersecting if and only if for any sets A, B € Ai either both 
A \ B and B \A are non-empty or both A \ B and B \A are empty; in other words, there exists 
no A, B G At such that A C B. Let Aa be a collection of subsets of U. We show that with 
high probability after testing 0{vrP) queries for sufficiently large constant c, the algRecoverBit 
algorithm recovers A a completely if A a is intersecting. First we show that with high probability 
the collection Aa is intersecting. 

Observation 3.4. Let A a be a collection ofm uniformly random subsets o/U where |U| > clogm. 
With probability at least 1 — A a is an intersecting family. 

Proof: The probability that ri C r 2 is (|)"' and there are at most m{m — 1) pairs of sets in Aa- 
Thus with probability at least 1 — > 1 — Aa is intersecting. ■ 

Observation 3.5. The number of distinct inputs of Alice (collections of random subsets of U^, 
that is distinguishable by algRecoverBit is 12(2"*"^). 


Proof: There are 2”^"' collections of m random subsets of U. By Observation 3.4, 0(2™”) of them 
are intersecting. Since we can only recover the sets in the input collection and not their order, the 
distinct number of input collection that are distinguished by algRecoverBit is which is 

j^( 2 mn) ^ ^ clogm. ■ 


By Observation 3.4 and only considering the case such that Aa is intersecting, we have the following 
lemma. 

Lemma 3.6. Let A a be a collection of m uniformly random subsets of U and suppose that |U| > 
clogm. After testing at most m'^ queries, with probability at least (1 — , A a is fully recovered, 

where p is the success rate of protocol I for the (Many vs One)-Set Disjointness problem. 

Proof: By Lemma 3.3, for each r;, C U of size cilogm the probability that ffh is disjoint from 
exactly one set in a random collection of sets Aa is at least l/m'^^'*'^. Given rh is disjoint from 
exactly one set in Aa, due to symmetry of the problem, the chance that rh is disjoint from a specific 
set r G A A is at least ^c\+i ■ After logm queries where a is a large enough constant, for 
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any r E Ta-, the probability that there is not a query that is only disjoint from r is at most 

_ log m alogm _ 1 

■' — vnA 

Thus after trying logm queries, with probability at least (1 — ) > (1 — ^), for each 

r G J-A we have at least one query that is only disjoint from r (and not any other sets in Ta \ i')- 
Once we have a query subset r^, which is only disjoint from a single set r G Ta^ we can ask 
n — clogm queries of size c\ logm + 1 and recover r. Note that if is disjoint from more than 
one sets in J^a simultaneously, the process (asking n — clogm queries of size ci logm + 1) will end 
up in recovering the union of those sets. Since J^a is an intersecting family with high probability 
(Observation 3.4), by pruning step in the algRecoverBit algorithm we are guaranteed that at the 
end of the algorithm, what we returned is exactly J-a- Moreover the total number of queries the 
algorithm makes is at most 


n X (am'^^"''^ logm) < am'^^"*'^ logm < m'^ 


for c > Cl + 4. 

Thus after testing m'^ queries, J-a will be recovered with probability at least (1 — where 

p is the success probability of the protocol I for (Many vs One)-Set Disjointness(m, n). ■ 

Corollary 3.7. Let I he a protocol for (Many vs One)-Set Disjointness(m, n) with error probability 
0{m~^) and s bits of communication such that n > clogm for large enough c. Then algRecover¬ 
Bit recovers Fa with constant success probability using s bits of communication. 

By Observation 3.5, since algRecoverBit distinguishes 0(2™”) distinct inputs with constant 
probability of success (by Corollary 3.7), the size of message sent by Alice, should be 0(mn). This 
proves Theorem 3.2. 

Proo/o/ Theorem 3.1: As we showed earlier, the communication complexity of (Many vs One)- 
Set Disjointness is a lower bound for the communication complexity of Set Cover. Theorem 3.2 
showed that any protocol for (Many vs One)-Set Disjointness(n, |Ta)| with error probability less 
than requires Ll{mn) bits of communication. Thus any single-round randomized protocol 

for Set Cover with error probability requires Q{mn) bits of communication. ■ 

Since any p-pass streaming a-approximation algorithm for problem P that uses 0{s) memory 
space, is a p-round two-party a-approximation protocol for problem P using 0(sp) bits of commu¬ 
nication [GM08], and by Theorem 3.1, we have the following lower bound for Set Cover problem in 
the streaming model. 

Theorem 3.8. Any single-pass randomized streaming algorithm for Set Cover(U, T") that computes 
a {3/2)-approximate solution with probability D(1 — m~'^) requires Q{mn) memory space (assuming 
n > Cl logm). 

4 Geometric Set Cover 

In this section, we consider the streaming Set Cover problem in the geometric settings. We present 
an algorithm for the case where the elements are a set of n points in the plane and the m sets 
are either all disks, all axis-parallel rectangles, or all a-fat triangles (which for simplicity we call 
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shapes) given in a data stream. As before, the goal is to find the minimum size cover of points 
from the given sets. We call this problem the Points-Shapes Set Cover problem. 

Note that, the description of each shape requires 0(1) space and thus the Points-Shapes Set 
Cover problem is trivial to be solved in 0(m -|- n) space. In this setting the goal is to design an 
algorithm whose space is sub-linear in 0{m + n). Here we show that almost the same algorithm as 
iterSetCover (with slight modifications) uses 0 (n) space to find an 0 (/ 9 )-approximate solution of 
the Points-Shapes Set Cover problem in constant passes. 

4.1 Preliminaries 

A triangle A is called a-fat (or simply fat) if the ratio between its longest edge and its height on 
this edge is bounded by a constant a > 1 (there are several equivalent definitions of a-fat triangles). 


Definition 4.1. Let (U,J^) be a set system such that U is a set of points and is a collection of 
shapes, in the plane M^. The canonical representation of (U, J^) is a collection T' of regions such 
that the following conditions hold. First, each r' G F' has 0(1) description. Second, for each 
r' ^ F', there exists r G such that r' n U C r n U. Finally, for each r G there exists ci sets 

r'l, • • • , G F' such that r n U = (r'^ U • • • U r(,^) n U for some constant ci. 

The following two results are from [EHR12] which are the formalization of the ideas in [AES 10]. 


Lemma 4.2. (Lemma 4.18 in [EHR12]) Given a set of points U in the plane and a parameter w, 
one can compute a set F[g^^i o/0(|U|t(;^ log |U|) axis-parallel rectangles with the following property. 
For an arbitrary axis-parallel rectangle r that contains at most w points of U, there exist two 
axis-parallel rectangles r'iF '2 ^ ^totai whose union has the same intersection with U as r, i.e., 
rn U = (r'l U r^) n U. 

Lemma 4.3. (Theorem 5.6 in [EHR12]) Given a set of points U in a parameter w and a 
constant a, one can compute a set of 0{\\J\w^ log^ |U|) regions each having 0(1) description 

with the following property. For an arbitrary a-fat triangle r that contains at most w points of U, 
there exist nine regions from whose union has the same intersection with U as r. 

Using the above lemmas we get the following lemma. 

Lemma 4.4. Let U be a set of points in and let F be a set of shapes (discs, axis-parallel rectan¬ 
gles or fat triangles), such that each set in F contains at most w points 0 / U. Then, in a single pass 
over the stream of sets F, one can compute the canonical representation F' of (U, J^). Moreover, 
the size of the canonieal representation is at most 0(|U|tc^log^ |U|) and the space requirement of 
the algorithm is 0{\^'\) = 0(|U|ty^). 

Proof: For the case of axis-parallel rectangles and fat triangles, first we use Lemma 4.2 and 
Lemma 4.3 to get the set .F(otai offline which require O(-^totai) = 0 (|U|w^ log^ |U|) memory space. 
Then by making one pass over the stream of sets F, we can find the canonical representation F' 
by picking all the sets S' G .F(otai such that S" n U C 5 n U for some S G F. For discs however, we 
just make one pass over the sets F and keep a maximal subset F' F F such that for each pair of 
sets S'i,S '2 G F' their projection on U are different, i.e., n U / n U. By a standard technique 
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of Clarkson and Shor [CS89], it can be proved that the size of the canonical representation, i.e., 
|S"|, is bounded by 0(|U|u)^). Note that this is just counting the number of discs that contain at 
most w points, namely the at most tc-level discs. ■ 

4.2 Algorithm 

The outline of the Points-Shapes-Set-Cover algorithm (shown in Figure 4.1) is very similar to the 
iterSetCover algorithm presented earlier in Section 2. 

In the first pass, the algorithm picks all the sets that cover a large number of yet-uncovered 
elements. Next, we sample S. Since we have removed all the ranges that have large size, in the first 
pass, the size of the remaining ranges restricted to the sample S is small. Therefore by Lemma 4.4, 
the canonical representation of (S,Ts) has small size and we can afford to store it in the memory. 
We use Lemma 4.4 to compute the canonical representation Ts in one pass. The algorithm then 
uses the sets in Ts to find a cover solg for the points of S. Next, in one additional pass, the 
algorithm replaces each set in solg by one of its supersets in T". 

Finally, note that in the algorithm of Section 2, we are assuming that the size of the optimal 
solution is 0{k). Thus it is enough to stop the iterations once the number of uncovered elements is 
less than k. Then we can pick an arbitrary set for each of the uncovered elements. This would add 
only k more sets to the solution. Using this idea, we can reduce the size of the sampled elements 
down to cpk{^Y log mlogn which would help us in getting near-linear space in the geometric 
setting. Note that the final pass of the algorithm can be embedded into the previous passes but for 
the sake of clarity we write it separately. 


4.3 Analysis 


By a similar approach to what we used in Section 2 to analyze the pass count and approximation 
guarantee of iterSetCover algorithm, we can show that the number of passes of the algGeomSC 
algorithm is ?>/6 + 1 (which can be reduced to 3 /5 with minor changes), and the algorithm returns 
an 0(/9/(i)-approximate solution. Next, we analyze the space usage and the correctness of the 
algorithm. Note that our analysis in this section only works for 5 < 1/4. 

Lemma 4.5. The algorithm uses 0(n) space. 


Proof: Consider an iteration of the algorithm. The memory space used in the first pass of each 
iteration is 0(^)- The size of S is cpk{n/kY logmlogn and after the first pass the size of each set 
is at most |U|/A:. Thus using Chernoff bound for each set r G T"/ sol. 


Pr 


|rnS|>(l + 2)M.U 






Thus, with probability at least 1 — (by the union bound), all the sets that are not picked in 
the first pass, cover at most 3|S|//c = c/?(n/A;)'^ logmlogn elements of S. Therefore, we can use 
Lemma 4.4 to show that the number of sets in the canonical representation of (S, Ts) is at most 

log^ |S|) = 0(/9'^nlog^mlog®n), 

as long as (5 < 1/4. To store each set in a canonical representation of (5,7-") only constant space is 
required. Moreover, by Lemma 4.4, the space requirement of the second pass is O(l-Tsl) = Oi'iT')- 
Therefore, the total required space is O(^) the lemma follows. ■ 
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algGeomSC (U, -T, S): 

for k G { 2 * I 0 < z < logn} do in parallel: // n = |U| 

Let L U and sol ^ 0 
Repeat 1/5 times: 

for r G do // Pass 
if |r n L| > |U|/A: then 
sol <r- sol U {r} 

L\r 

S <— sample of L of size cpk{n/kY \ogm\ogn 

Fs •(— compCanonicalRep(S, // Pass 

S 0 I 5 algOfflineSC(S, -Fs) 

for r G -F do // Pass 

if 3r' G solg s.t. r' n S C r n S then 
sol <r- sol U {r} 

S 0 I 5 <r- S 0 I 5 \ {r'} 

L ^ L \ r 

for r G F" do // Final Pass 

if r n L 7 ^ 0 then 
sol sol U {r} 

L\r 

return smallest sol computed in parallel 
Figure 4.1: A streaming algorithm for Points-Shapes Set Cover problem. 

Theorem 4.6. Given a set system defined over a set U 0 / n points in the plane, and a set of m 
ranges T (which are either all disks, axis-parallel rectangles, or fat triangles). Let p be the quality 
of approximation to the offline set-cover solver we have, and let 0 < 5 < 1/4: be an arbitrary 
parameter. 

Setting 5 = 1/4, the algorithm algGeomSC, depicted in Figure 4-1, with high probability, 
returns an 0{p)-approximate solution of the optimal set cover solution for the instance (U,F). 
This algorithm uses O(^) space, and performs constant passes over the data. 

Proof: As before consider the run of the algorithm in which |OPT| < k < 2|OPT|. Let V be the 
set of uncovered elements L at the beginning of the iteration and note that the total number of 
sets that is picked during the iteration is at most {1 4- cip)k where ci is the constant defined in 
Definition 4.1. Let Q denote all possible such covers, that is ^ = {F' C F | |F'| < (1 + cip)k'/. Let 
Ti be the collection that contains all possible set of uncovered elements at the end of the iteration, 
defined as Lf = {V \ UreC \ ^ ^ G}- Set p = {k/nY, e = 1/2 and q = Since for large enough 

c, ^(log |Lf| log ^ + log^) < cpkin/kY Figmlogn = |S| with probability at least 1 — by 

Lemma 2.5, the set of sampled elements S is a relative (p, e)-approximation sample of (V,Lf). 

Let C C F be the collection of sets picked in the third pass of the algorithm that covers all 
elements in S. By Lemma 4.4, \C\ < cipk for some constant ci. Since with high probability S 
is a relative (p, e)-approximation sample of (V,Lf), the number of uncovered elements of V (or L) 
after adding C to sol is at most ep|V| < \\}\{k/nY. Thus with probability at least (1 — m~Y, in 
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each iteration and by adding 0{pk) sets, the number of uncovered elements reduces by a factor of 

{n/kY. 


Therefore, after 4 iterations (for 5 = 1/4) the algorithm picks 0{pk) sets and with high prob¬ 
ability the number of uncovered elements is at most n{k/nY^^ = k. Thus, in the final pass the 
algorithm only adds k sets to the solution sol, and hence the approximation factor of the algorithm 
is 0{p). ■ 


Remark 4.7. The result of Theorem 4.6 is similar to the result of Agarwal and Pan [AP14] - except 
that their algorithm performs O(logn) iterations over the data, while the algorithm of Theorem 4.6 
performs only a constant number of iterations. In particular, one can use the algorithm of Agarwal 
and Pan [AP14] as the offline solver. 


5 Lower bound for multipass algorithms 

In this section we give lower bound on the memory space of multipass streaming algorithms for 
the Set Cover problem. Our main result is space for streaming algorithms that return an 

optimal solution of the Set Cover problem in 0{l/5) passes for m = 0{n). Our approach is to 
reduce the communication Intersection Set Chasing(re,p) problem introduced by Guruswami and 
Onak [G013] to the communication Set Cover problem. 

Consider a communication problem P with n players Pi, • • • ,P„. The problem P is a (n, r)- 
communication problem if players communicate in r rounds and in each round they speak in order 
Pi, • • • ,Pn- At the end of the rth round Pn should return the solution. Moreover we assume private 
randomness and public messages. In what follows we define the communication Set Chasing and 
Intersection Set Chasing problems. 

Definition 5.1 (Communication Set Chasing Problem). The Set Chasing(n,p) problem is a {p,p — 1) 
communication problem in which the player i has a function fi : [n] —)• 2["'l and the goal is to 
compute /i(/ 2 (- • • /p({l}) ■ ■ ■)) where fi{S) = Uses Figure 5.1(a) shows an instance of the 

communication Set Chasing (4, 3). 

Definition 5.2 (Communication Intersection Set Chasing). The Intersection Set Chasing(n,p) is a 
{2p, p—1) communication problem in which the first p players have an instance of the Set Chasing(n, p) 
problem and the other p players have another instance of the Set Chasing(re,p) problem. The output 
of the Intersection Set Chasing(n, p) is 1 if the solutions of the two instances of the Set Chasing(n, p) 
intersect and 0 otherwise. Figure 5.1(b) shows an instance of the Intersection Set Chasing (4,3). 
The function fi of each player Pi is specified by a set of directed edges form a copy of vertices 
labeled {1, • • • ,n} to another copy of vertices labeled {1, • • • , n}. 

The communication Set Chasing problem is a generalization of the well-known communication 
Pointer Chasing problem in which player i has a function fi : [n] —)• [n] and the goal is to compute 
/l(/2 (-••/p(l) •••))• 

[G013] showed that any randomized protocol that solves Intersection Set Chasing(n,p) with 
error probability less than 1/10, requires 11 ( ) bits of communication where n is sufficiently 

large and p < ■ lu Theorem 5.4, we reduce the communication Intersection Set Chasing 

problem to the communication Set Cover problem and then give the first superlinear memory lower 
bound for the streaming Set Cover problem. 
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Figure 5.1: (a) shows an example of the communication Set Chasing(4, 3) and (b) is an instance of 
the communication Intersection Set Chasing(4, 3). 


Definition 5.3 (Communication Set Cover(U,J^,p) Problem). The communication Set Cover(n,p) is 
a {p,p — 1) communication problem in which a collection of elements U is given to all players and 
each player i has a collection of subsets of U, T). The goal is to solve Set Cover(LljT"! U • • • U T),) 
using the minimum number of communication bits. 

Theorem 5.4. Any (1/25—1) passes streaming algorithm that solves the Set Cover(U, T") optimally 
with constant probability of error requires memory space where 6 > and m = 0{n). 

Consider an instance ISC of the communication Intersection Set Chasing(n,p). We construct an 
instance of the communication Set Cover(LljT", 2p) problem such that solving Set Cover(LljT") op¬ 
timally determines whether the output of ISC is 1 or not. 

The instance ISC consists of 2p players. Each player 1, • • • ,p has a function /j : [n] —)> 21’^! and 
each player p -|- 1 , • • • ,2p has a function /( : [n] —)■ 2[”1 (see Figure 5.1). In ISC, each function /j is 
shown by a set of vertices vf, ■ ■ ■ ,vf and such that there is a directed edge from '^i+i 

to vf if and only if £ G Similarly, each function f- is denoted by a set of vertices uj, - ■ ■ ,uf 

and such that there is a directed edge from to ul if and only if i G /((/) (see 

Figure 5.2(a) and Figure 5.2(b)). 

In the corresponding communication Set Cover instance of ISC, we add two elements in(u/) and 
out(u/) per each vertex vj where i < p -|- 1, j < n. We also add two elements in('u^) and out(ri^) 
per each vertex uj where i < p -|- 1, j < u. In addition to these elements, for each player i, we add 
an element e* (see Figure 5.2(c) and Figure 5.2(d)). 

Next, we define a collection of sets in the corresponding Set Cover instance of ISC. For each 
player T), where 1 < i < p, we add a single set Sj containing out and in(uf) for all out-going 

edges Moreover, all Sf sets contain the element e^. Next, for each vertex vj we add a set 

Rj that contains the two corresponding elements of vj, in(u^) and out(u/). In Figure 5.2(c), the 
red rectangles denote i?-type sets and the curves denote S-type sets for the hrst half of the players. 

Similarly to the sets corresponding to players 1 to p, for each player Pp+i where 1 < i < p, we 
add a set containing in(u/) and out(uf_|_j^) for all in-coming edges (rif_,_;^, u/) of ul (denoting 
fl~^{j))- The set contains the element Cp+j too. Next, for each vertex ul we add a set 

that contains the two corresponding elements of u/, in(u^) and out(u^). In Figure 5.2(d), the red 
rectangles denote T-type sets and the curves denote S'-type sets for the second half of the players. 

At the end, we merge u^s and tt)^s as shown in Figure 5.3. After merging the corresponding 
sets of u-Js (i?]^, • • • , Rf) and the corresponding sets of u{s (T/, • • • , Tf), we call the merged sets 
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Figure 5.2: The gadgets used in the reduction of the communication Intersection Set Chasing 
problem to the communication Set Cover problem, (a) and (c) shows the construction of the 
gadget for players 1 to p and (c) and (d) shows the construction of the gadget for players p + 1 to 

2p. 


The main claim is that if the solution of ISC is 1 then the size of an optimal solution of its 
corresponding Set Cover instance SC is (2p + l)n + 1; otherwise, it is (2p + l)n + 2. 

Lemma 5.5. The size of any feasible solution of SC is at least (2p + l)n + 1. 

Proof: For each player i (1 < * < p), since out are only covered by Rlj^i and at least n 

sets are required to cover out(Uj^^^), • • • , out(u^^). Moreover for player Pp, since in(Up_,_^)s are only 
covered by Rpj^i and is only covered by Sp, all n + 1 sets Rpj^i, • • • , Rpj^i, Sp must be selected 
in any feasible solution of SC. 

Similarly for each player p + i (1 < i < p), since in(u^)s are only covered by T/ and at 

least n sets are required to cover in(u|), • • • , in(u”). Moreover, considering Up_^_l, • • • , Upj^i, since 
in(Up_|_^) is only covered by all n sets • • • , must be selected in any feasible solution 

of SC. 

All together, at least ( 2 p + l)n + 1 sets should be selected in any feasible solution of SC. ■ 


Lemma 5.6. Suppose that the solution of ISC is 1. Then the size of an optimal solution of its 
eorresponding Set Cover instance is exactly (2p + l)n + 1. 

Proof: By Lemma 5.5, the size of an optimal solution of S is at least (2p+l)n+l. Here we prove that 
(2p+l)re+l sets suffice when the solution of ISC is 1. Let Q = Vp_^_i,Vp^,... ..., Up 
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Figure 5.3; In (b) two Set Chasing instances merge in their first set of vertices and (c) shows the 
corresponding gadgets of these merged vertices in the communication Set Cover. 



Figure 5.4: In (a), path Q is shown with black dashed arcs and (6) shows the corresponding cover 
of path Q. 


be a path in ISC such that ji = ii (since the solution of ISC is 1 such a path exists). The corre¬ 
sponding solution to Q can be constrncted as follows (See Figure 5.4): 

• Pick Sp and all Rp^iS (n -|- 1 sets). 

• For each vj' in Q where 1 < i < p, pick the set in the solution. Moreover, for each such 
i pick all sets Rj where j / ji {n{p — 1 ) sets). 

• For (or pick the set Sp]^^. Moreover, pick all sets T( where j ^ ji (n sets). 

• For each in Q where 1 < i < p, pick the set in the solntion. Moreover, for each such 
i pick all sets Tf where i £i {n{p — 1) sets). 

• Pick all Tp_|_iS (n sets). 

It is straightforward to see that the solution constructed above is a feasible solution. ■ 

Lemma 5.7. Suppose that the size of an optimal solution of the corresponding Set Cover instance 
of ISC, SC, is {2p + l)n + 1. Then the solution of ISC is 1. 

Proof: As we proved earlier in Lemma 5.5, any feasible solution of SC picks • • • , Rp+i, Sp and 

Tp_^_i, • • • , Moreover, we proved that for each 1 < i < p, at least n sets shonld be selected 

from • • • , Sj, • • • ,Sf. Similarly, for each 1 < i < p, at least n sets should be selected 


20 









































































from rf, • • • , T”, • • • , Thus if a feasible solution of SC, OPT, is of size {2p + l)n + 1, 

it has exactly n sets from each specified group. 

Next we consider the first half of the players and second half of the players separately. Consider 
i such that 1 < i < p. Let , • • • , be the sets picked in the optimal solution (because of Cj there 
should be at least one set of form Sf in OPT). Since each out is only covered by and 
for all j ^ {ji,... ,jk}, R-i+i should be selected in OPT. Moreover, for all j G {ji, • • • ,jk}, Rl+i 
should not be contained in OPT (otherwise the size of OPT would be larger than (2p + l)n + 1). 
Consider j G {ji,... ,jk}- Since is not in OPT, there should be a set selected in OPT 
such that in is contained in Thus by considering 5jS in a decreasing order and using 

induction, if Sf is in OPT then is reachable form 

Next consider a set Sp_^_^ that is selected inOPT(l<i<p). By similar argument, Tf is not in 
OPT and there exists a set Sp_^_^_i (or if i = 1) in OPT such that out(ul) is contained in 
Let • • • ,Ui\i be the set of vertices whose corresponding out elements are in S^p_^-. Then by 
induction, there exists an index r such that is reachable from Vp_^_i and is also reachable 
from all • • • , Moreover, the way we constructed the instance SC guarantees that all sets 
S\p, • • • , contains out(Up^^). Hence if the size of an optimal solution of SC is {2p+ l)n +1 then 
the solution of ISC is 1. ■ 

Corollary 5.8. Intersection Set Chasing(n,p) returns 1 if and only if the size of optimal solution 
of its corresponding Set Cover instance (as described here) is {2p + l)n + 1. 

Observation 5.9. Any streaming algorithm for Set Cover, X, that in i passes solves the problem 
optimally with a probability of error err and consumes s memory space, solves the corresponding 
communication Set Cover problem in t rounds using 0{si‘^) bits of communication with probability 
error err. 

Proof: Starting from player Pi, each player runs X over its input sets and once Pi is done with its 
input, she sends the working memory of X publicly to other players. Then next player starts the 
same routine using the state of the working memory received from the previous player. Since X 
solves the Set Cover instance optimally after i passes using 0{s) space with probability error err, 
applying X as a black box we can solve P in £ rounds using 0{s£‘^) bits of communication with 
probability error err. ■ 

Proof o/ Theorem 5.4: By Observation 5.9, any Pround 0(s)-space algorithm that solves streaming 
Set Cover (U, P) optimally can be used to solve the communication Set Cover(U, P, p) problem in i 
rounds using 0{si‘^) bits of communication. Moreover, by Corollary 5.8, we can decide the solution 
of the communication Intersection Set Chasing(n, p) by solving its corresponding communication Set 
Cover problem. Note that while working with the corresponding Set Cover instance of Intersection 
Set Chasing(n,p), all players know the collection of elements U and each player can construct its 
collection of sets Pj using /j (or f)). 

However, by a result of [G013], we know that any protocol that solves the communication In¬ 
tersection Set Chasing(n,p) problem with probability of error less than 1/10, requires 
bits of communication. Since in the corresponding Set Cover instance of the communication Inter¬ 
section Set Chasing(n,p), |U| = (2p-|-l) X 2n + 2p = 0{np) and |P| < (2p-|-l)n-|-2pn = 0{np), any 
{p — l)-pass streaming algorithm that solves the Set Cover problem optimally with a probability 
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of error at most 1/10, requires ) bits of communication. Then using Observation 5.9, 

since 6 > any (^ — l)-pass streaming algorithmof Set Cover that finds an optimal solution 

with error probability less than 1/10, requires 0(|.T| • lUI*^) space. ■ 


6 Lower Boundfor Sparse Set Cover in Multiple Passes 

In this part we give a stronger lower bound for the instances of the streaming Set Cover problem 
with sparse input sets. An instance of the Set Cover problem is s-Sparse Set Cover, if for each set 
r G we have |r| < s. We can us the same reduction approach described earlier in Section 5 to show 
that any (1/2(5 — l)-pass streaming algorithm for s-Sparse Set Cover requires n(| J^|s) memory space 
if s < |U|'^ and T = 0{\J). To prove this, we need to explain more details of the approach of [GO 13] 
on the lower bound of the communication Intersection Set Chasing problem. They first obtained a 
lower bound for Equal Pointer Chasing(n, p) problem in which two instances of the communication 
Pointer Chasrng(n,p) are given and the goal is to decide whether these two instances point to a 
same value or not; fp{- •• /i(l) •••) = //(••• /((I) ••• )• 

Definition 6.1 (r-non-injective functions). A function / : [n] —>■ [n] is called r-non-injective if there 
exists A C [n] of size at least r and h G [n] such that for all a G A, /(a) = b. 

Definition 6.2 (Pointer Chasing Problem). Pointer Chasrng(n,p) is a (p, p—l) communication prob¬ 
lem in which the player i has a function /j : [n] —)• [re] and the goal is to compute /i(/ 2 (- • • /p(l) • • • ))• 

Definition 6.3 (Equal Limited Pointer Chasing Problem). Equal Pointer Chasing(re,p) is a {2p,p — 
1) communication problem in which the first p players have an instance of the Pointer Chasing(re, p) 
problem and the other p players have another instance of the Pointer Chasing(re,p) problem. The 
output of the Equal Pointer Chasing(re,p) is 1 if the solutions of the two instances of Pointer 
Chasing(re,p) have the same value and 0 otherwise. Furthermore in another variant of pointer 
chasing problem, Equal Limited Pointer Chasing(re,p, r), if there exists r-non-injective function fi, 
then the output is 1. Otherwise, the output is the same as the value in Equal Pointer Chasing(re,p). 

For a boolean communication problem P, ORt(P) is dehned to be OR of t instances of P and the 
output of ORt(P) is true if and only if the output of any of the t instances is true. Using a direct 
sum argument, [G013] showed that the communication complexity of ORt(Equal Limited Pointer 
Chasing(re, p, r)) is t times the communication complexity of Equal Limited Pointer Chasing(re, p, r). 


Lemma 6.4 ([G013]). Let re, p, t and r be positive integers such that re > 5p, t < j and 
r = O(logre). Then the amount of bits of communication to solve ORi/Equal Limited Pointer 
Chasing(re,p, r)/ with error probability less than 1/3 is ^( pi6*iog„ ) ~ 0{pt^). 

Lemma 6.5 ([G013]). Let re, p, t and r be positive integers such that t^^r^ ^ < re/10. Then 
if there is a protocol that solves Intersection Set Chasing(re, p) with probability of error less than 
1/10 using C bits of communication, there is a protocol that solves ORi/Equal Limited Pointer 
Chasing(re,p, r)) with probability of error at most 2/10 using C + 2p bits of communication. 

Gonsider an instance of ORt(Equal Limited Pointer Chasing {n,p,r)) in which t <n^,r = log(re),p = 
^ — 1 where | = o(logre). By Lemma 6.4, the required amount of bits of communication to solve 
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the instance with constant success probability is Then,applying Lemma 6.5, to solve the 

corresponding Intersection Set Chasing, f2(tn) bits of communication is required. 

In the reduction from ORj (Equal Limited Pointer Chasing(n,p, r)) to Intersection Set Chasing(n,p) 
(proof of Lemma 6.5), the r-non-injective property is preserved. In other words, in the corre¬ 
sponding Intersection Set Chasing instance each player’s functions fi : [n] — 2^^^ is union of t 
r-non-injective functions fi{a) := /i,i(a) U • • • U Given that none of the fij functions is 

r-non-injective, the corresponding Set Cover instance will have sets of size at most rt (S'-type sets 
are of size at most t for 1 < i < p and of size at most rt for p -|- 1 < i < 2p). Since r = O(logn), 
the corresponding Set Cover instance is (9(t)-sparse. As we showed earlier in the reduction from 
Intersection Set Chasing to Set Cover, the number of elements (and sets) in the corresponding 
Set Cover instance is 0{np). Thus we have the following result for s-Sparse Set Cover problem. 

Theorem 6.6. For s < |U|'^, any streaming algorithm that solves s-Sparse Set Cover(U,T") opti¬ 
mally with probability of error less than 1/10 in (^ — 1) passes requires fi(|.T|s) memory space for 
T=0{[J). 
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